The following report details the methods used to determine appropriate filter thresholds for SNV variant calls.
Creating simulated data
At the site level, three major filters were applied to obtain a high-quality variant set:
QD - variant quality normalized by read depth
SOR - strand odds ratio (SOR)
FS Fisherstrand
To find the optimal filter thresholds, we examined each metric independently and in combination.
Note: filter thresholds were only optimized for SNVs.
- A truth set of variants was created using either a subset of real variants from the 2018 release, or randomly selected positions across the genome. The results and conclusions were essentially the same with both sets of truth variants, so only the former results are shown below.
- The truth set of variants were inserted into the N2 strain bam (
N2.bam) using bamsurgeon (20d431e).

- Variants were called with the
wi-gatk-nf pipeline.
QD filter threshold
To reduce complexity, the filter thresholds were optimized independently.
The optimal QD threshold were determined as follows:
- For each filter threshold, variants were classified as detected or undetected. For example, the table below illustrates a few rows of the filter threshold
QD > 10. Evidence for a variant may be detected, but if the QD or other metric fails a filter it is classified as undetected.
|
CHROM
|
POS
|
QD
|
sim1_genotype
|
sim2_genotype
|
sim3_genotype
|
==>
|
pass_QD_filter
|
is_detected
|
|
I
|
1352
|
110
|
1/1
|
1/1
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
|
I
|
2566
|
90
|
1/1
|
0/0
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
|
I
|
3847
|
2
|
0/0
|
1/1
|
0/0
|
QD threshold is 10
|
no
|
no
|
|
I
|
4975
|
38
|
1/1
|
0/0
|
0/0
|
QD threshold is 10
|
no
|
no
|
|
I
|
5590
|
298
|
1/1
|
1/1
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
- Each variant, depending on whether it is detected and whether it is in the truth set, will fall into 1 of the 4 categories: true positive, true negative, false positive and false negative.
|
CHROM
|
POS
|
is_detected
|
is_in_truth
|
category
|
|
I
|
1352
|
yes
|
yes
|
true positive
|
|
I
|
2566
|
yes
|
no
|
false positve
|
|
I
|
3847
|
no
|
no
|
true negative
|
|
I
|
4975
|
no
|
yes
|
false negative
|
|
I
|
5590
|
yes
|
yes
|
true positive
|
- A confusion matrix can then be created using the variant counts for each category.
|
|
in_truth
|
not_in_truth
|
|
detected
|
count of true positive
|
count of false positive
|
|
not_detected
|
count of false negative
|
count of true negative
|
- Filter thresholds were chosen to maximize true positive rate and precision, while minimizing false positive rate. Red points indicate the chosen thresholds to use for each filter.
SOR threshold
- The same steps were taken to find an optimal
SOR threshold as above.
FS treshold
- The same steps were taken to find an optimal
FS threshold.
QD, SOR, FS filter thresholds in combination
- A subset of QD / SOR / FS thresholds were examined in combination. Red dots represent thresholds chosen from independent analysis of each filter, and indicate that thresholds identified by examining filters individually matches closely thresholds arrived at by examining filters in combination.
